Introduction

The primary goal of our project is to determine if we can closely predict movie ratings before critic ratings are released and the films are given an average user score on IMDb. Being able to determine whether or not a movie will be a quality release without critic ratings could address a number of business problems that surround the financing and marketing strategies for different types of movies. Film studios and movie theatres could use these insights to budget accordingly and properly advertise movies to maximize popularity, ratings, and profits. The dataset we will be using contains 5,043 movies, and 28 variables related to the movie.

The primary research questions we wish to ask revolve around predicting a movie’s IMDb score. One of which is will the number of faces in a movie poster correlate with the movie’s rating? Can the multiple actor variables and the director variable be useful in predicting a movie’s IMDb score? Does a movie’s parental guidance rating (meaning R, PG-13, etc.) play into how well a movie scores? Does a movie’s budget help determine a movie’s critical success? Do certain genres tend to receive higher critic scores?

Data Source and Collection

Our data was retrieved as a CSV file from Kaggle:

https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

The data was retrieved by scraping 28 values for 5043 movies from the IMDb website and by using facial recognition software run on 4906 posters. The movies’ release dates span over 100 years and the movies originate from 66 different countries. The variables of the dataset are as follows:

Some of our data will need to be cleansed before moving on to analysis. The variable “director_facebook_likes” is inaccurate in many instances. Foreign films using currencies different from USD may have a misrepresented gross revenue, because the values aren’t equivalent with foreign currencies. This can be addressed by transforming these values by an exchange rate. We would also like to retrieve the release date for each film, since that variable was not included in the original dataset.

The goal of our project is to find connections between variables and the film’s IMDb score. We believe that the key variables are the following: facenumber_in_poster, director_name, actor_1_name, actor_2_name, and actor_3_name. Our hypothesis is that these variables will be crucial in predicting a movie’s overall critic score. We also hypothesize that budget, gross, actors, directors, genre and content rating will play into how much a movie grosses in revenue.

Preperation Work

Clear the environment

rm(list = ls())

Load piping Package

# Note: User may need install packages if they have not previously used them
library(magrittr)

Load Data Manipulation Packages

library(plyr)
library(dplyr)

Read in Data

Load movie CSV file in as a data frame

movie_df <- read.csv("movie_metadata CSV.csv", header = TRUE) 

# Change blank spaces to NA
movie_df[movie_df==""] <- NA

show structure of movie_df

print(str(movie_df))
## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 NA 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
##  $ movie_title              : Factor w/ 4917 levels "#Horror ","[Rec] 2 ",..: 398 2731 3279 3707 3332 1961 3289 3459 399 1631 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 NA 2714 1970 2163 3018 2941 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 NA 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 NA 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 NA 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 NA 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
## NULL

Data Cleansing and Manipulation

Most of the variables for the IMDb data set were formatted correctly to begin with, but some cleansing was required.

See if there are any observations with NA values for title or score making analysis useles

count(movie_df[!(is.na(movie_df$movie_title)) & !(is.na(movie_df$imdb_score)), ])
## # A tibble: 1 x 1
##       n
##   <int>
## 1  5043

It appears that every observation has a score and title

Convert categorical variables into factors to improve analyses

Color is a factor but is currently a character type, need to change it to a number. 1 = “Color”, 0 = “Black and White”

movie_df$color <- ifelse(movie_df$color == "Color", 1, 0)
movie_df$color <- as.factor(movie_df$color)

Convert number of faces in poster to a factor

movie_df$facenumber_in_poster <- as.factor(movie_df$facenumber_in_poster)
movie_df$title_year <- as.factor(movie_df$title_year)

We created a new variable that calculated the gross profit of a film by subtracting the budget of the film from the gross revenue. This will allow us to perform analyses later on and see how other factors may affect the profit of movies.

create new variable gross_profit by subtracting budget from gross

# Create new vector, positive means budget was lower than gross, negative means that budget exceeds gross
movie_df$gross_profit <- movie_df$gross - movie_df$budget

The values of the “movie_title” variable in our data set had the character “” appended at the end and white space at the end, so we removed the character and trimmed the white space for better readability and to prevent any issues that might have arisen without eliminating the extra white space.

# Remove the  at the end of the titles
movie_df$movie_title <- gsub("Â", "", movie_df$movie_title) 

# trim whitespace
movie_df$movie_title <- trimws(movie_df$movie_title)

Check to see if there are duplicated movie titles

# Show number of unique titles
length(unique(movie_df$movie_title))
## [1] 4916
# Show number of duplicated titles
anyDuplicated(movie_df$movie_title)
## [1] 138

We need to remove duplicates to avoid skew in our findings, but cannot base it just on titles as some movies have remakes. We are chose to remove films that have the same title, director, and release year, as this means that it is likeley they are duplicates

# statement will remove movies if they have duplicate titles, directors, release years and release years
movie_df <- movie_df %>% distinct(movie_title, director_name, title_year, .keep_all = T)

It appears that there are 4919 unique movies; movies with unique titles, directors and release years

Show updated structure of data set

str(movie_df)
## 'data.frame':    4919 obs. of  29 variables:
##  $ color                    : Factor w/ 2 levels "0","1": 2 2 2 2 NA 2 2 2 2 2 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
##  $ movie_title              : chr  "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 NA 2714 1970 2163 3018 2941 ...
##  $ facenumber_in_poster     : Factor w/ 19 levels "0","1","2","3",..: 1 1 2 1 1 2 1 2 5 4 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 NA 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 NA 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 NA 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 NA 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : Factor w/ 91 levels "1916","1920",..: 84 82 90 87 NA 87 82 85 90 84 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
##  $ gross_profit             : num  523505847 9404152 -44925825 198130642 NA ...

Show summary statistics now that the dataset has been cleansed

Load stargazer to improve appearance of summary statistics

library(stargazer)
stargazer(movie_df, type = "text", title = "Movie Data frame Summary Statistics")
## 
## Movie Data frame Summary Statistics
## =============================================================================================
## Statistic                   N        Mean         St. Dev.           Min            Max      
## ---------------------------------------------------------------------------------------------
## num_critic_for_reviews    4,870    138.010         120.254            1             813      
## duration                  4,904    107.089         25.279             7             511      
## director_facebook_likes   4,817    690.705        2,832.107           0            23,000    
## actor_3_facebook_likes    4,896    630.961        1,625.429           0            23,000    
## actor_1_facebook_likes    4,912   6,490.876      15,103.080           0           640,000    
## gross                     4,056 47,621,564.000 67,363,869.000        162        760,505,847  
## num_voted_users           4,919   82,618.200     138,285.800          5          1,689,764   
## cast_total_facebook_likes 4,919   9,574.646      18,159.990           0           656,730    
## num_user_for_reviews      4,898    267.604         372.839            1            5,060     
## budget                    4,434 39,288,218.000 208,557,941.000       218       12,215,500,000
## actor_2_facebook_likes    4,906   1,621.072       4,010.223           0           137,000    
## imdb_score                4,919     6.438           1.128           1.600          9.500     
## aspect_ratio              4,593     2.222           1.403           1.180          16.000    
## movie_facebook_likes      4,919   7,345.294      19,200.710           0           349,000    
## gross_profit              3,790 5,468,410.000  228,898,340.000 -12,213,298,588  523,505,847  
## ---------------------------------------------------------------------------------------------
# Stargazer only shows statistics for non categorical and character types

Data Visualization and Analysis

Load plotly visualization package

# Note: plotly package graphs are interactive
library(plotly) 
library(ggplot2)

Show the distribution of movie scores with a kernal density plot

# Kernel density plot
score_density <- density(movie_df$imdb_score) # Calculate kernel density estimate
plot(score_density, type = "n", main = "Kernel Density Estimate of Movie Scores")
polygon(score_density, col = "wheat")

The kernal density plot shows the density of each movie score. We can see that the most common score range is from approximately 6 to 6.5. This means that most films released will receive a score close to a 6. We need to utilize more visualization techniques to discover what causes films to reach exceptional scores to go above and beyond the average.

We can look to see which years tend to have higher quality movies based on average IMDb scores of movies over time.

# Create data frame with year and average score
yearly_score <- ddply(movie_df, c("title_year"), summarise,
                      mean(imdb_score, na.rm = TRUE))
colnames(yearly_score) <- c("title_year", "avg_score")

# Plotly line graph that shows average imdb scores across the years
plot_ly(yearly_score, x = yearly_score$title_year, y = yearly_score$avg_score, type = "scatter", mode = "lines") %>% 
  layout(title = "Year and Score") 

Up until the 1970s, only a few movies may be registered in the IMDb database for a given year, with many years only yielding a single movie. This leads to large variances between years up until the 1970s, such as how a single movie in 1920 is completely representative of the year’s extremely low average movie IMDb score of 4.8, only to surge the exisitng data point because a single movie performed better in 1925 and received a rating of 8.3. Regardless, the general trend appears to be a declining average IMDb score over time. This may not reflect the quality of large production films, but rather reduced barriers to entry have enabled a surge in indie films to enter the market and the fact that the IMDb database can also register these films with the rise of internet-based titles being published on platforms like YouTube.

Show number of movies released in each year

Create df with number year and movies in that year

movie_yearly_count <- movie_df %>% 
  group_by(title_year) %>%
  summarise(yearly_total = sum(length(title_year)))

Verify sum of number of movies in the data frame

sum(movie_yearly_count$yearly_total)
## [1] 4919

Plotly line graph that shows number of films released by year in the data set

plot_ly(movie_yearly_count, 
        x = movie_yearly_count$title_year, y = movie_yearly_count$yearly_total, type = "scatter", mode = "lines") %>% 
  layout(title = "Total films released by year")
# Plot skips some years along the x-axis, but this was for spacing; the interacting with the chart will show count for every year

This graph shows the total films released by year included in the data set. There are way more movies released than this throughout these years, but the data set only includes a few thousand movie observations. The older years may have biased average scores, because there are so few entries included in the data set for that year.

We can explore which actors have the highest average movie score overall.

# Find actors with highest average movie score 
actor_rating_df <- ddply(movie_df, c("actor_1_name"), summarise, # Pull out actor 1 and rating data
                      actor_m <- mean(imdb_score, na.rm = TRUE),  # Find the mean of actors and discount na values
                      number <- length(na.omit(imdb_score))) # Count number of movies, but omit na values
colnames(actor_rating_df) <- c("actor_name", "avg_score", "num_of_movies") # Rename headers
actor_avg <- actor_rating_df[which(actor_rating_df$num_of_movies >= 7),]# Actors in at least 7 movies to avoid skew
actor_avg <- actor_avg[which(actor_avg$avg_score >= 7.15),] # Show only highest scoring actors for readability

# Convert actor_name to a factor
actor_avg$actor_name <- factor(actor_avg$actor_name)

# Show highest to lowest average score
actor_avg$actor_name <- reorder(actor_avg$actor_name, actor_avg$avg_score)

# Visualization of highest scoring actors
ggplot(actor_avg, aes(x = avg_score, y = actor_name)) + # Sets x to average score and y to the name
  geom_point(aes(colour = actor_name)) + # Makes a point graph with different colors per actor
  theme(axis.text = element_text(size = 10)) + # Sets font size
  ggtitle("Actor Scores") + # Changes chart title
  xlab("Average Score") + # Changes x field title
  ylab("Actor Name")  # Changes y field table

We can see a visual representation of actors with the highest average IMDb ratings for the films they star in, with the top 3 being Leonardo DiCaprio, Tom Hanks, and Clint Eastwood. Predictions can be made for movie score utilizing these top scoring actors. This is implemented later on in this report.

We can explore which directors have the highest average movie score overall.

# Find actors with highest average movie score 
dir_rating_df <- ddply(movie_df, c("director_name"), summarise, # Pull out director name and rating data
                      dir_m <- mean(imdb_score, na.rm = TRUE),  # Find the mean of directors and discount na values
                      number <- length(na.omit(imdb_score))) # Count number of movies, but omit na values
colnames(dir_rating_df) <- c("director_name", "avg_score", "num_of_movies") # Rename headers
dir_avg <- dir_rating_df[which(dir_rating_df$num_of_movies >= 7),]# Direct  at least 7 movies to avoid skew
dir_avg <- dir_avg[which(dir_avg$avg_score >= 7.5),] # Show only highest scoring directors for readability
dir_avg <- dir_avg[which(dir_avg$director_name != ''),] # Exclude movies who have no director name 

# Convert actor_name to a factor
dir_avg$director_name <- factor(dir_avg$director_name)

# Show highest to lowest average score
dir_avg$director_name <- reorder(dir_avg$director_name, dir_avg$avg_score)

# Visualization of highest scoring actors
ggplot(dir_avg, aes(x = avg_score, y = director_name)) + # Sets x to average score and y to the name
  geom_point(aes(colour = director_name)) + # Makes a point graph with different colors per director
  theme(axis.text = element_text(size = 10, colour = 'black')) + # Sets font size
  ggtitle("Director Scores") +  # Changes chart title
  xlab("Average Score") + # Changes x field title
  ylab("Director Name") # Changes y field table

We can see a visual representation of directors with the highest average IMDb ratings for the films they star in, with the top 3 being Christopher Nolan, Quentin Tarantino, and James Cameron. Predictions can be made for movie score utilizing these top scoring directors. This is implemented later on in this report.

We can explore the average IMDb score for each content rating.

content_df <- ddply(movie_df, c("content_rating"), summarise, # New DF w/ rating, avg score and # of movies 
                    mean(imdb_score, na.rm = TRUE), # Creates column containing mean scores
                    length(na.omit(imdb_score))) # Omits scores with NA values
colnames(content_df) <- c("content_rating", "avg_score", "number_of_films") # Rename columns
content_df <- content_df[which(content_df$content_rating != ""),] # Eliminate films from analysis without rating

# Visualization of average score by content rating
ggplot(content_df, aes(x = content_rating, y = avg_score)) + # Bar graph with x as rating and y as score
  geom_bar(stat = "identity", width = .5, fill = 'deepskyblue3') + # Creates space between bars and changes color
  theme(axis.text = element_text(size = 8, colour = 'black', angle = 45)) + # Sets text size to 8, text color to black, and angles text for separation purposes
  ggtitle("Content Scores") + # Title of graph
  xlab("Content Rating") + # Title of x axis
  ylab("IMDB Score") # Title of y axis

Based on this plot, we can see the highest performing category of movies based on content rating is TV-MA (intended for a mature audience), with the lowest performing category being PG-13.This could possibly be the effect of restricted possibilities of movies that target a PG-13 audience not being as favored by mature audiences who watch both TV-MA movies and PG-13 movies.

We can show the spread of scores per content rating category with box plots.

# Create data frame that excludes movies with no content rating 
content_df2 <- movie_df[which(movie_df$content_rating != ''), ]

#Use plotly for great boxplot capabilities, note: Might need to install "plotly"
plot_ly(content_df2, x = content_df2$imdb_score, color = content_df2$content_rating, type = "box") %>%
  layout(title = "Movie Type and Score Range")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

These boxplots show the range film scores for each content rating. These outliers should not be removed because the values are legitimate. It is very possible for movie scores to achieve scores outside the boxplot range; either higher or lower.

Explore Average IMDB rating by number of faces in poster

# Create df with face numbers in poster and respective average score
face_df <- movie_df %>%
  group_by(face_count = facenumber_in_poster) %>%
  summarise(avg_score = mean(imdb_score)) %>%
  na.omit() # Omit NA values

# Bar Graph of data
ggplot(face_df, aes(x = face_count, y = avg_score)) +
  geom_bar(stat = "identity", width = .5, fill = "deepskyblue3") +
  xlab("Number of Faces") +
  ylab("Average Score") +
  ggtitle("Average Score by faces in poster") +
  theme(axis.text = element_text(size = 8, angle = 0))

The average score by faces in poster column graph shows whether having less faces in poster or more affects average score. It does not appear there is a strong relationship between the variables. While the poster with 43 faces has the highest average score, it is because there is only one movie with that the score so it is biased. The rest of the number of faces in posters seems to be fairly consistent throughout minus a few low-points.

Show 5 number summary for average score based on color using boxplot method

# Create Data Frame with average score and color
color_film <- movie_df %>%
  select(color, imdb_score) %>% na.omit()

ggplot(color_film, aes(x = color, y = imdb_score, group = color, fill = color)) +
  geom_boxplot() +
  xlab("Color Type") +
  ylab("Score") +
  ggtitle("Black and White vs Color Film Scores")

On average colored films tend to score a higher IMDb. We decided to keep the outliers because they are valid data.

Visualize how language affects movie score

language_df <- movie_df %>%
  group_by(language) %>%
  summarise(avg_score = mean(imdb_score)) %>% na.omit()

ggplot(language_df, aes(x = language, y = avg_score)) +
  geom_bar(stat = "Identity", width = .5, fill = "deepskyblue3") +
  theme(axis.text = element_text(size = 10, angle = 90)) + 
  xlab("Original Language") +
  ylab("Average Score") +
  ggtitle("Language and Score")

Based on the chart the language that scored the highest IMDb score is the Telungu language. This is not very representative because of the small number of movies that are actually in that language. It would not be a fair assumption to say that a language that has a small amount of movies and a high IMDb score would always get a higher score than a language that has more movies.

# Load in world map from map_data
world_countries <- map_data("world2")

Visualize how the country the film was released in affects average score

ScoreByCountry_df <- movie_df %>%
  group_by(country) %>%
  summarise(avg_score = mean(imdb_score)) %>% na.omit()

score_plot <- ggplot()

score_plot <- score_plot + geom_map(data = world_countries, map = world_countries,
                    aes(x = long, y = lat, map_id = region),
                    fill="#ffffff", color="#ffffff", size=0.15)
## Warning: Ignoring unknown aesthetics: x, y
score_plot <- score_plot + geom_map(data = ScoreByCountry_df, map = world_countries,
                    aes(fill = avg_score, map_id = country),
                    color="#ffffff", size=0.15)

score_plot <- score_plot + labs(fill = "Average IMDb Score",
                title = "Average IMDb Score by Country",
                x="", y="")

score_plot <- score_plot + theme(panel.border = element_blank())
#score_plot <- score_plot + theme(panel.background = element_blank())
score_plot <- score_plot + theme(axis.ticks = element_blank())
score_plot <- score_plot + theme(axis.text = element_blank())

score_plot

Based on the chart all countries have around the same score. There is some bias with the African nations who have a small amount of films compared to the other countries.

Investigate to see if number of movie Facebook likes and content rating are indicators of score

#movie_likes <- movie_df %>% select(movie_facebook_likes, imdb_score, content_rating)
plot_ly(movie_df, x = movie_df$movie_facebook_likes, y = movie_df$imdb_score,
        color =movie_df$content_rating , mode = "markers",text=paste('Content:',movie_df$content_rating))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

It appears that all the movie ratings tend to score around the same score. It also appears that the R movie rating is the most commonly rated movie.

Clear out environment for information management purposes

rm(list = ls()[!ls() %in% c("movie_df")])

Correlation

We chose to use Pearson Correlation, because the measures are not ordinal, and measures are linear

Pearson Correlation Table 1

Load Corrplot

library(corrplot)

Create data frame with only numeric types to see how numeric variables correlate to IMDb score

numeric_df <- select_if(movie_df, is.numeric) # Other option cor(movie_df[sapply(movie_df, is.numeric)])

# Creates a correlation table showing variables correlate w/ each other by number and text color 
num_correlation <- round(cor(numeric_df, use = "pairwise.complete.obs", method = "pearson"),1)
corrplot(num_correlation, method = "number", type = )

# Note: After running this chunk, you can also select hte num_correlation item from the global environment to view a generic table

Pearson Correlation Table 2, Visualized with Circles

# Creates a correlation table showing variables correlate w/ each other by color and  circle size
corrplot(num_correlation, method = "circle")

It appears that most variables have a very small correlation with each other. Some exceptions are the actor_1_facebook_likes and cast_total_facebook_likes as well as budget and gross_profit. This might be because if there are a lot of likes on the lead actor then the rest of the cast will be liked as well. Since there is a low correlation we may not be able to predict the scores very well.

Clear out global environment except for movie_df for information management purposes

rm(list = ls()[!ls() %in% c("movie_df")])

Multiple Linear Regression

Create data frame to be used for prediction

str(movie_df)
## 'data.frame':    4919 obs. of  29 variables:
##  $ color                    : Factor w/ 2 levels "0","1": 2 2 2 2 NA 2 2 2 2 2 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
##  $ movie_title              : chr  "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 NA 2714 1970 2163 3018 2941 ...
##  $ facenumber_in_poster     : Factor w/ 19 levels "0","1","2","3",..: 1 1 2 1 1 2 1 2 5 4 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 NA 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 NA 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 NA 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 NA 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : Factor w/ 91 levels "1916","1920",..: 84 82 90 87 NA 87 82 85 90 84 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
##  $ gross_profit             : num  523505847 9404152 -44925825 198130642 NA ...

Eliminate NA values

movie_df <- movie_df %>% na.omit()

Create factors of highest scoring actors and directors

movie_df$dicaprio <- ifelse(movie_df$actor_1_name == "Leonardo DiCaprio", 1, 0)
movie_df$hanks <- ifelse(movie_df$actor_1_name == "Tom Hanks", 1, 0)
movie_df$eastwood <- ifelse(movie_df$actor_1_name == "Clint Eastwood", 1, 0)
movie_df$hardy <- ifelse(movie_df$actor_1_name == "Tom Hardy", 1, 0)
movie_df$rickman <- ifelse(movie_df$actor_1_name == "Alan Rickman", 1, 0)
movie_df$cumberbatch <- ifelse(movie_df$actor_1_name == "Benedict Cumberbatch", 1, 0)
movie_df$bale <- ifelse(movie_df$actor_1_name == "Christian Bale", 1, 0)
movie_df$hoffman <- ifelse(movie_df$actor_1_name == "Philip Seymour Hoffman", 1, 0)
movie_df$driver <- ifelse(movie_df$actor_1_name == "Minni Driver", 1, 0)
movie_df$ford <- ifelse(movie_df$actor_1_name == "Harrison Ford", 1, 0)
movie_df$spacy <- ifelse(movie_df$actor_1_name == "Kevin Spacey", 1, 0)
movie_df$nolan <- ifelse(movie_df$director_name == "Christopher Nolan", 1, 0)
movie_df$tarantino <- ifelse(movie_df$director_name == "Quentin Tarantino", 1, 0)
movie_df$cameron <- ifelse(movie_df$director_name == "James Cameron", 1, 0)
movie_df$jackson <- ifelse(movie_df$director_name == "Peter Jackson", 1, 0)
movie_df$fincher <- ifelse(movie_df$director_name == "David Fincher", 1, 0)
movie_df$scorsese <- ifelse(movie_df$director_name == "Martin Scorsese", 1, 0)
movie_df$anderson <- ifelse(movie_df$director_name == "Wes Anderson", 1, 0)
movie_df$Greengrass <- ifelse(movie_df$director_name == "Paul Greengrass", 1, 0)

Build multiple linear regression model to see how the significance of variables on IMDB Score

Load MASS Package to fit model

library(MASS)

Regress IMDB score on all other variables and use step method to determine which model to use

# Fit the model
score_fit <- lm(imdb_score ~ + country + content_rating + color + title_year + language +
                  facenumber_in_poster + gross + dicaprio + hanks + eastwood + hardy +
                  rickman + cumberbatch + bale + hoffman + driver + ford + spacy + nolan +
                  tarantino + cameron + jackson + fincher + scorsese + anderson + Greengrass, 
                data = movie_df)

# Perform stepwise model seleciton 
score_step <- stepAIC(score_fit, direction = "both")
## Start:  AIC=-356.34
## imdb_score ~ +country + content_rating + color + title_year + 
##     language + facenumber_in_poster + gross + dicaprio + hanks + 
##     eastwood + hardy + rickman + cumberbatch + bale + hoffman + 
##     driver + ford + spacy + nolan + tarantino + cameron + jackson + 
##     fincher + scorsese + anderson + Greengrass
## 
## 
## Step:  AIC=-356.34
## imdb_score ~ country + content_rating + color + title_year + 
##     language + facenumber_in_poster + gross + dicaprio + hanks + 
##     eastwood + hardy + rickman + cumberbatch + bale + hoffman + 
##     ford + spacy + nolan + tarantino + cameron + jackson + fincher + 
##     scorsese + anderson + Greengrass
## 
##                        Df Sum of Sq    RSS     AIC
## - title_year           73   119.102 3100.0 -359.10
## - facenumber_in_poster 18    27.612 3008.5 -358.63
## - cameron               1     0.534 2981.4 -357.68
## <none>                              2980.9 -356.34
## - jackson               1     2.238 2983.2 -355.59
## - cumberbatch           1     2.416 2983.3 -355.37
## - hardy                 1     2.430 2983.3 -355.36
## - eastwood              1     3.270 2984.2 -354.33
## - ford                  1     3.464 2984.4 -354.09
## - rickman               1     3.927 2984.8 -353.52
## - dicaprio              1     3.968 2984.9 -353.47
## - Greengrass            1     4.028 2984.9 -353.40
## - bale                  1     5.271 2986.2 -351.88
## - nolan                 1     8.538 2989.5 -347.88
## - hoffman               1     9.999 2990.9 -346.09
## - scorsese              1    11.535 2992.4 -344.22
## - spacy                 1    11.666 2992.6 -344.05
## - fincher               1    12.452 2993.4 -343.10
## - anderson              1    12.532 2993.4 -343.00
## - tarantino             1    13.927 2994.8 -341.29
## - hanks                 1    15.530 2996.4 -339.34
## - color                 1    24.448 3005.4 -328.47
## - language             28   105.107 3086.0 -285.65
## - country              39   124.203 3105.1 -285.09
## - content_rating       11   171.818 3152.7 -173.46
## - gross                 1   264.966 3245.9  -47.00
## 
## Step:  AIC=-359.1
## imdb_score ~ country + content_rating + color + language + facenumber_in_poster + 
##     gross + dicaprio + hanks + eastwood + hardy + rickman + cumberbatch + 
##     bale + hoffman + ford + spacy + nolan + tarantino + cameron + 
##     jackson + fincher + scorsese + anderson + Greengrass
## 
##                        Df Sum of Sq    RSS     AIC
## - facenumber_in_poster 18    26.879 3126.9 -363.54
## - cameron               1     1.051 3101.1 -359.86
## <none>                              3100.0 -359.10
## - jackson               1     1.795 3101.8 -358.99
## - hardy                 1     2.134 3102.1 -358.59
## - cumberbatch           1     2.874 3102.9 -357.71
## - dicaprio              1     2.877 3102.9 -357.71
## + title_year           73   119.102 2980.9 -356.34
## - Greengrass            1     4.148 3104.2 -356.21
## - rickman               1     4.181 3104.2 -356.17
## - bale                  1     4.870 3104.9 -355.36
## - nolan                 1     7.985 3108.0 -351.70
## - eastwood              1     8.039 3108.1 -351.63
## - ford                  1     8.369 3108.4 -351.25
## - hoffman               1     9.059 3109.1 -350.43
## - spacy                 1    10.659 3110.7 -348.55
## - anderson              1    11.951 3112.0 -347.04
## - fincher               1    12.590 3112.6 -346.28
## - tarantino             1    14.155 3114.2 -344.45
## - scorsese              1    15.199 3115.2 -343.22
## - hanks                 1    15.391 3115.4 -343.00
## - color                 1    33.939 3134.0 -321.29
## - language             28   101.592 3201.6 -297.21
## - country              39   137.271 3237.3 -278.69
## - content_rating       11   189.467 3289.5 -164.22
## - gross                 1   283.669 3383.7  -40.99
## 
## Step:  AIC=-363.54
## imdb_score ~ country + content_rating + color + language + gross + 
##     dicaprio + hanks + eastwood + hardy + rickman + cumberbatch + 
##     bale + hoffman + ford + spacy + nolan + tarantino + cameron + 
##     jackson + fincher + scorsese + anderson + Greengrass
## 
##                        Df Sum of Sq    RSS     AIC
## - cameron               1     1.203 3128.1 -364.13
## <none>                              3126.9 -363.54
## - jackson               1     1.729 3128.6 -363.52
## - hardy                 1     2.042 3128.9 -363.15
## - dicaprio              1     2.938 3129.8 -362.11
## - cumberbatch           1     3.054 3129.9 -361.97
## - rickman               1     4.184 3131.1 -360.65
## - Greengrass            1     4.478 3131.4 -360.31
## - bale                  1     5.034 3131.9 -359.66
## + facenumber_in_poster 18    26.879 3100.0 -359.10
## + title_year           73   118.370 3008.5 -358.63
## - nolan                 1     8.428 3135.3 -355.70
## - ford                  1     8.501 3135.4 -355.61
## - eastwood              1     8.581 3135.5 -355.52
## - hoffman               1     9.672 3136.6 -354.25
## - spacy                 1     9.881 3136.8 -354.00
## - anderson              1    12.058 3139.0 -351.47
## - fincher               1    13.057 3140.0 -350.30
## - tarantino             1    14.276 3141.2 -348.89
## - hanks                 1    15.199 3142.1 -347.81
## - scorsese              1    15.511 3142.4 -347.45
## - color                 1    35.102 3162.0 -324.73
## - language             28   103.668 3230.6 -300.30
## - country              39   140.741 3267.6 -280.58
## - content_rating       11   192.543 3319.4 -167.08
## - gross                 1   287.558 3414.5  -43.90
## 
## Step:  AIC=-364.13
## imdb_score ~ country + content_rating + color + language + gross + 
##     dicaprio + hanks + eastwood + hardy + rickman + cumberbatch + 
##     bale + hoffman + ford + spacy + nolan + tarantino + jackson + 
##     fincher + scorsese + anderson + Greengrass
## 
##                        Df Sum of Sq    RSS     AIC
## - jackson               1     1.672 3129.8 -364.18
## <none>                              3128.1 -364.13
## - hardy                 1     2.027 3130.1 -363.77
## + cameron               1     1.203 3126.9 -363.54
## - cumberbatch           1     3.032 3131.1 -362.59
## - dicaprio              1     3.246 3131.3 -362.34
## - rickman               1     4.161 3132.3 -361.27
## - Greengrass            1     4.445 3132.5 -360.94
## - bale                  1     5.026 3133.1 -360.26
## + title_year           73   118.978 3009.1 -359.90
## + facenumber_in_poster 18    27.030 3101.1 -359.86
## - nolan                 1     8.270 3136.4 -356.48
## - ford                  1     8.387 3136.5 -356.34
## - eastwood              1     8.555 3136.7 -356.15
## - hoffman               1     9.626 3137.7 -354.90
## - spacy                 1     9.835 3137.9 -354.66
## - anderson              1    12.054 3140.2 -352.07
## - fincher               1    12.990 3141.1 -350.98
## - tarantino             1    14.166 3142.3 -349.61
## - hanks                 1    15.072 3143.2 -348.56
## - scorsese              1    15.295 3143.4 -348.30
## - color                 1    35.057 3163.2 -325.39
## - language             28   103.825 3231.9 -300.76
## - country              39   141.335 3269.4 -280.57
## - content_rating       11   194.427 3322.5 -165.68
## - gross                 1   300.436 3428.5  -30.85
## 
## Step:  AIC=-364.18
## imdb_score ~ country + content_rating + color + language + gross + 
##     dicaprio + hanks + eastwood + hardy + rickman + cumberbatch + 
##     bale + hoffman + ford + spacy + nolan + tarantino + fincher + 
##     scorsese + anderson + Greengrass
## 
##                        Df Sum of Sq    RSS     AIC
## <none>                              3129.8 -364.18
## + jackson               1     1.672 3128.1 -364.13
## - hardy                 1     2.013 3131.8 -363.83
## + cameron               1     1.146 3128.6 -363.52
## - cumberbatch           1     3.016 3132.8 -362.66
## - dicaprio              1     3.192 3133.0 -362.45
## - rickman               1     4.141 3133.9 -361.35
## - Greengrass            1     4.416 3134.2 -361.02
## - bale                  1     5.013 3134.8 -360.33
## + facenumber_in_poster 18    26.961 3102.8 -359.81
## + title_year           73   118.613 3011.2 -359.43
## - nolan                 1     8.157 3137.9 -356.66
## - ford                  1     8.279 3138.0 -356.52
## - eastwood              1     8.533 3138.3 -356.23
## - hoffman               1     9.593 3139.4 -354.99
## - spacy                 1     9.806 3139.6 -354.74
## - anderson              1    12.060 3141.8 -352.12
## - fincher               1    12.939 3142.7 -351.10
## - tarantino             1    14.139 3143.9 -349.70
## - hanks                 1    14.944 3144.7 -348.76
## - scorsese              1    15.306 3145.1 -348.34
## - color                 1    35.069 3164.8 -325.44
## - language             28   103.929 3233.7 -300.75
## - country              39   146.907 3276.7 -274.48
## - content_rating       11   194.540 3324.3 -165.71
## - gross                 1   310.242 3440.0  -20.63
# Show resultsd
score_step$Anova
## NULL

Since the original model had a lower AIC than the stepped models, we chose to use this model

Build final regression model

# Create multiple linear regression model
final_linear <- lm(imdb_score ~ country + content_rating + color + title_year + 
    language + facenumber_in_poster + gross + dicaprio + hanks + 
    eastwood + hardy + rickman + cumberbatch + bale + hoffman + 
    driver + ford + spacy + nolan + tarantino + cameron + jackson + 
    fincher + scorsese + anderson + Greengrass, data = movie_df)

# Show the summary of the multiple linear regression model
summary(final_linear)
## 
## Call:
## lm(formula = imdb_score ~ country + content_rating + color + 
##     title_year + language + facenumber_in_poster + gross + dicaprio + 
##     hanks + eastwood + hardy + rickman + cumberbatch + bale + 
##     hoffman + driver + ford + spacy + nolan + tarantino + cameron + 
##     jackson + fincher + scorsese + anderson + Greengrass, data = movie_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5274 -0.4954  0.0497  0.6160  2.6719 
## 
## Coefficients: (6 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.329e+00  1.873e+00   3.913 9.27e-05 ***
## countryArgentina         9.373e-01  1.451e+00   0.646 0.518487    
## countryAruba            -1.814e+00  1.615e+00  -1.123 0.261410    
## countryAustralia        -9.442e-02  1.327e+00  -0.071 0.943269    
## countryBelgium           6.190e-01  1.613e+00   0.384 0.701205    
## countryBrazil            5.535e-01  1.535e+00   0.361 0.718493    
## countryCanada           -3.129e-01  1.323e+00  -0.236 0.813067    
## countryChile             9.237e-01  1.624e+00   0.569 0.569505    
## countryChina             7.194e-01  1.419e+00   0.507 0.612087    
## countryColombia          7.585e-01  1.638e+00   0.463 0.643421    
## countryCzech Republic   -1.013e-02  1.482e+00  -0.007 0.994548    
## countryDenmark           4.129e-01  1.382e+00   0.299 0.765148    
## countryFinland          -1.935e-01  1.636e+00  -0.118 0.905890    
## countryFrance            5.648e-02  1.323e+00   0.043 0.965937    
## countryGeorgia          -5.882e-01  1.640e+00  -0.359 0.719836    
## countryGermany          -3.591e-01  1.323e+00  -0.271 0.786045    
## countryGreece            5.631e-01  1.617e+00   0.348 0.727753    
## countryHong Kong         1.430e-01  1.385e+00   0.103 0.917781    
## countryHungary          -1.759e-01  1.615e+00  -0.109 0.913259    
## countryIceland           4.918e-01  1.614e+00   0.305 0.760562    
## countryIndia             2.898e-01  1.535e+00   0.189 0.850237    
## countryIndonesia        -3.502e-01  1.875e+00  -0.187 0.851875    
## countryIran              1.162e-01  1.656e+00   0.070 0.944059    
## countryIreland           6.137e-01  1.364e+00   0.450 0.652820    
## countryIsrael            8.487e-01  1.746e+00   0.486 0.626912    
## countryItaly             7.221e-01  1.375e+00   0.525 0.599457    
## countryJapan            -1.104e+00  1.367e+00  -0.807 0.419490    
## countryMexico            6.293e-01  1.400e+00   0.449 0.653118    
## countryNetherlands       3.862e-01  1.571e+00   0.246 0.805767    
## countryNew Line         -2.038e+00  1.614e+00  -1.263 0.206727    
## countryNew Zealand       6.024e-01  1.364e+00   0.442 0.658778    
## countryNorway            1.695e-01  1.549e+00   0.109 0.912873    
## countryOfficial site     3.154e-01  1.615e+00   0.195 0.845211    
## countryPeru             -1.487e+00  1.618e+00  -0.919 0.358114    
## countryPoland           -9.567e-01  1.614e+00  -0.593 0.553345    
## countryRomania          -1.717e+00  1.614e+00  -1.064 0.287485    
## countryRussia           -9.153e-01  1.623e+00  -0.564 0.572797    
## countrySouth Africa      7.301e-01  1.424e+00   0.513 0.608140    
## countrySouth Korea      -1.455e+00  1.427e+00  -1.020 0.307775    
## countrySpain             5.936e-01  1.337e+00   0.444 0.657102    
## countryTaiwan            7.615e-01  1.572e+00   0.484 0.628200    
## countryThailand         -6.401e-01  1.618e+00  -0.396 0.692317    
## countryUK                3.071e-01  1.319e+00   0.233 0.815908    
## countryUSA              -1.877e-01  1.318e+00  -0.142 0.886733    
## countryWest Germany      3.251e-01  1.667e+00   0.195 0.845358    
## content_ratingG          2.861e-01  5.605e-01   0.511 0.609699    
## content_ratingGP        -7.171e-03  1.266e+00  -0.006 0.995482    
## content_ratingM          1.777e+00  1.277e+00   1.392 0.164027    
## content_ratingNC-17      3.896e-01  6.823e-01   0.571 0.568028    
## content_ratingNot Rated  6.671e-01  5.896e-01   1.131 0.257967    
## content_ratingPassed     5.286e-01  1.431e+00   0.369 0.711779    
## content_ratingPG         2.355e-01  5.563e-01   0.423 0.672007    
## content_ratingPG-13      3.259e-01  5.559e-01   0.586 0.557736    
## content_ratingR          7.662e-01  5.556e-01   1.379 0.167992    
## content_ratingUnrated    7.639e-01  5.842e-01   1.308 0.191104    
## content_ratingX          3.307e-01  6.522e-01   0.507 0.612152    
## color1                  -5.013e-01  9.409e-02  -5.328 1.06e-07 ***
## title_year1929          -5.364e-01  1.907e+00  -0.281 0.778519    
## title_year1933           5.161e-01  1.385e+00   0.373 0.709455    
## title_year1935           1.377e+00  1.477e+00   0.932 0.351306    
## title_year1936           1.838e+00  1.372e+00   1.340 0.180316    
## title_year1937           9.155e-01  1.478e+00   0.619 0.535820    
## title_year1939           1.069e+00  1.372e+00   0.779 0.435891    
## title_year1940           1.166e+00  1.479e+00   0.788 0.430663    
## title_year1946           5.678e-01  1.194e+00   0.476 0.634415    
## title_year1947           5.578e-01  1.354e+00   0.412 0.680322    
## title_year1948           1.122e+00  1.479e+00   0.759 0.448022    
## title_year1950           4.715e-01  1.904e+00   0.248 0.804455    
## title_year1952          -8.973e-02  1.356e+00  -0.066 0.947227    
## title_year1953           5.829e-02  1.237e+00   0.047 0.962432    
## title_year1954           6.804e-01  1.203e+00   0.566 0.571741    
## title_year1957           1.442e+00  1.371e+00   1.052 0.293083    
## title_year1959           1.113e+00  1.355e+00   0.822 0.411332    
## title_year1960           1.184e+00  1.369e+00   0.864 0.387428    
## title_year1961           6.800e-01  1.385e+00   0.491 0.623475    
## title_year1962           1.379e+00  1.239e+00   1.113 0.265727    
## title_year1963           9.024e-01  1.270e+00   0.711 0.477350    
## title_year1964           1.051e+00  1.189e+00   0.884 0.376774    
## title_year1965           7.887e-01  1.113e+00   0.709 0.478631    
## title_year1966           2.027e+00  1.602e+00   1.265 0.205877    
## title_year1967           2.514e-01  1.479e+00   0.170 0.865035    
## title_year1968           9.886e-01  1.206e+00   0.820 0.412514    
## title_year1969          -7.901e-01  1.380e+00  -0.572 0.567098    
## title_year1970           8.578e-01  1.117e+00   0.768 0.442684    
## title_year1971           1.117e-01  1.202e+00   0.093 0.925967    
## title_year1972           8.350e-01  1.218e+00   0.686 0.492969    
## title_year1973           3.900e-01  1.090e+00   0.358 0.720669    
## title_year1974           6.429e-01  1.075e+00   0.598 0.549741    
## title_year1975           1.246e+00  1.141e+00   1.092 0.274881    
## title_year1976           9.334e-01  1.201e+00   0.777 0.437231    
## title_year1977           3.704e-01  1.076e+00   0.344 0.730717    
## title_year1978          -8.631e-03  1.076e+00  -0.008 0.993599    
## title_year1979           3.250e-01  1.077e+00   0.302 0.762879    
## title_year1980           7.586e-02  1.039e+00   0.073 0.941797    
## title_year1981          -2.230e-01  1.033e+00  -0.216 0.829124    
## title_year1982          -1.604e-01  1.037e+00  -0.155 0.877057    
## title_year1983           7.472e-02  1.039e+00   0.072 0.942673    
## title_year1984           1.531e-02  1.028e+00   0.015 0.988115    
## title_year1985           8.615e-03  1.035e+00   0.008 0.993356    
## title_year1986          -4.902e-01  1.023e+00  -0.479 0.631770    
## title_year1987          -4.195e-01  1.021e+00  -0.411 0.681065    
## title_year1988          -3.083e-01  1.021e+00  -0.302 0.762689    
## title_year1989          -9.991e-03  1.020e+00  -0.010 0.992181    
## title_year1990          -3.821e-02  1.023e+00  -0.037 0.970191    
## title_year1991          -3.431e-01  1.020e+00  -0.336 0.736635    
## title_year1992          -1.380e-01  1.019e+00  -0.135 0.892302    
## title_year1993          -1.108e-01  1.015e+00  -0.109 0.913079    
## title_year1994          -2.272e-01  1.014e+00  -0.224 0.822799    
## title_year1995          -3.471e-01  1.012e+00  -0.343 0.731743    
## title_year1996          -3.868e-01  1.011e+00  -0.383 0.701945    
## title_year1997          -4.127e-01  1.011e+00  -0.408 0.683055    
## title_year1998          -3.492e-01  1.009e+00  -0.346 0.729221    
## title_year1999          -4.136e-01  1.009e+00  -0.410 0.681814    
## title_year2000          -5.440e-01  1.008e+00  -0.540 0.589387    
## title_year2001          -5.101e-01  1.008e+00  -0.506 0.612984    
## title_year2002          -5.422e-01  1.008e+00  -0.538 0.590776    
## title_year2003          -6.060e-01  1.008e+00  -0.601 0.547590    
## title_year2004          -3.764e-01  1.007e+00  -0.374 0.708599    
## title_year2005          -3.496e-01  1.008e+00  -0.347 0.728828    
## title_year2006          -4.260e-01  1.007e+00  -0.423 0.672440    
## title_year2007          -2.737e-01  1.008e+00  -0.271 0.786070    
## title_year2008          -4.523e-01  1.007e+00  -0.449 0.653411    
## title_year2009          -4.545e-01  1.007e+00  -0.451 0.651876    
## title_year2010          -4.426e-01  1.008e+00  -0.439 0.660724    
## title_year2011          -4.526e-01  1.008e+00  -0.449 0.653483    
## title_year2012          -4.039e-01  1.008e+00  -0.401 0.688630    
## title_year2013          -2.808e-01  1.009e+00  -0.278 0.780738    
## title_year2014          -3.879e-01  1.009e+00  -0.384 0.700711    
## title_year2015          -4.010e-01  1.010e+00  -0.397 0.691268    
## title_year2016          -3.577e-01  1.014e+00  -0.353 0.724349    
## languageArabic           3.497e-01  1.150e+00   0.304 0.761112    
## languageAramaic          6.829e-02  1.144e+00   0.060 0.952421    
## languageBosnian         -2.655e+00  1.145e+00  -2.319 0.020442 *  
## languageCantonese       -5.347e-02  8.634e-01  -0.062 0.950619    
## languageCzech            2.309e-01  1.328e+00   0.174 0.862022    
## languageDanish           3.159e-01  9.511e-01   0.332 0.739803    
## languageDari             8.389e-01  1.145e+00   0.733 0.463624    
## languageDutch                   NA         NA      NA       NA    
## languageEnglish         -6.749e-01  6.639e-01  -1.017 0.309428    
## languageFilipino        -9.022e-01  1.146e+00  -0.787 0.431175    
## languageFrench           3.399e-01  6.919e-01   0.491 0.623233    
## languageGerman           6.542e-01  7.513e-01   0.871 0.383950    
## languageHebrew                  NA         NA      NA       NA    
## languageHindi                   NA         NA      NA       NA    
## languageHungarian        3.107e-02  1.475e+00   0.021 0.983199    
## languageIndonesian       7.912e-01  1.159e+00   0.683 0.494705    
## languageItalian         -1.211e+00  8.404e-01  -1.441 0.149624    
## languageJapanese         1.545e+00  8.007e-01   1.930 0.053701 .  
## languageKazakh          -1.301e+00  1.147e+00  -1.134 0.256923    
## languageKorean           2.010e+00  9.536e-01   2.108 0.035139 *  
## languageMandarin        -7.189e-01  8.557e-01  -0.840 0.400936    
## languageMaya             5.971e-01  1.144e+00   0.522 0.601845    
## languageMongolian        8.705e-01  1.485e+00   0.586 0.557752    
## languageNone             1.976e+00  1.144e+00   1.726 0.084373 .  
## languageNorwegian               NA         NA      NA       NA    
## languagePersian          1.256e+00  1.320e+00   0.952 0.341375    
## languagePortuguese              NA         NA      NA       NA    
## languageRomanian         2.391e+00  1.487e+00   1.608 0.107883    
## languageRussian          1.916e-01  1.485e+00   0.129 0.897343    
## languageSpanish         -5.043e-01  7.224e-01  -0.698 0.485193    
## languageThai             1.194e-01  1.268e+00   0.094 0.925025    
## languageVietnamese       4.734e-01  1.145e+00   0.414 0.679227    
## languageZulu            -2.728e-01  1.141e+00  -0.239 0.811075    
## facenumber_in_poster1    8.938e-03  3.983e-02   0.224 0.822459    
## facenumber_in_poster2   -5.615e-02  4.866e-02  -1.154 0.248672    
## facenumber_in_poster3   -6.115e-02  6.136e-02  -0.997 0.319037    
## facenumber_in_poster4   -1.397e-01  7.994e-02  -1.748 0.080556 .  
## facenumber_in_poster5   -1.906e-01  1.125e-01  -1.695 0.090160 .  
## facenumber_in_poster6   -1.151e-01  1.289e-01  -0.894 0.371581    
## facenumber_in_poster7   -1.545e-01  1.787e-01  -0.864 0.387467    
## facenumber_in_poster8   -1.706e-01  1.730e-01  -0.986 0.324047    
## facenumber_in_poster9   -2.779e-01  2.958e-01  -0.939 0.347589    
## facenumber_in_poster10  -4.249e-01  3.818e-01  -1.113 0.265918    
## facenumber_in_poster11  -1.663e+00  4.176e-01  -3.983 6.95e-05 ***
## facenumber_in_poster12  -2.029e-01  5.381e-01  -0.377 0.706203    
## facenumber_in_poster13  -3.049e-01  9.328e-01  -0.327 0.743810    
## facenumber_in_poster14   1.053e-01  1.008e+00   0.104 0.916808    
## facenumber_in_poster15   1.779e-01  4.664e-01   0.381 0.702859    
## facenumber_in_poster19  -1.049e+00  9.319e-01  -1.126 0.260432    
## facenumber_in_poster31   7.011e-01  9.317e-01   0.752 0.451805    
## facenumber_in_poster43   1.722e+00  9.314e-01   1.849 0.064605 .  
## gross                    4.385e-09  2.500e-10  17.540  < 2e-16 ***
## dicaprio                 4.760e-01  2.218e-01   2.146 0.031911 *  
## hanks                    8.224e-01  1.937e-01   4.246 2.23e-05 ***
## eastwood                 5.695e-01  2.923e-01   1.949 0.051426 .  
## hardy                    5.444e-01  3.241e-01   1.680 0.093133 .  
## rickman                  7.105e-01  3.328e-01   2.135 0.032817 *  
## cumberbatch              5.934e-01  3.543e-01   1.675 0.094046 .  
## bale                     5.215e-01  2.108e-01   2.474 0.013418 *  
## hoffman                  7.741e-01  2.272e-01   3.407 0.000664 ***
## driver                          NA         NA      NA       NA    
## ford                     3.850e-01  1.919e-01   2.006 0.044978 *  
## spacy                    7.366e-01  2.001e-01   3.680 0.000236 ***
## nolan                    1.086e+00  3.449e-01   3.149 0.001655 ** 
## tarantino                1.339e+00  3.330e-01   4.021 5.91e-05 ***
## cameron                  2.851e-01  3.622e-01   0.787 0.431173    
## jackson                  6.078e-01  3.770e-01   1.612 0.107022    
## fincher                  1.124e+00  2.956e-01   3.802 0.000146 ***
## scorsese                 9.034e-01  2.469e-01   3.660 0.000256 ***
## anderson                 1.347e+00  3.532e-01   3.814 0.000139 ***
## Greengrass               7.659e-01  3.542e-01   2.162 0.030649 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9281 on 3461 degrees of freedom
## Multiple R-squared:  0.271,  Adjusted R-squared:  0.2302 
## F-statistic: 6.633 on 194 and 3461 DF,  p-value: < 2.2e-16

Remove MASS package from the environment as it affects the dplyr package

detach("package:MASS", unload = F)

Predictive Modeling

Load caret Package

library(caret)

Create 10-Fold validation configuration of control

fit_control <- trainControl(method = "cv", number = 10)

Train Multiple Linear Regression model to predict imdb_score

# Set seed to obtain same results each time
set.seed(1234)

# Train linear model
multi_fit <- train(imdb_score ~ country + content_rating + color + title_year + 
    language + facenumber_in_poster + gross + dicaprio + hanks + 
    eastwood + hardy + rickman + cumberbatch + bale + hoffman + 
    driver + ford + spacy + nolan + tarantino + cameron + jackson + 
    fincher + scorsese + anderson + Greengrass, 
                   data = movie_df,
                   trControl = fit_control,
                   method = "lm")

# Print results
print(multi_fit)
## Linear Regression 
## 
## 3656 samples
##   26 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3291, 3291, 3291, 3291, 3290, 3291, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.9511951  0.1993298  0.7377693
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Output the summary of the linear model
summary(multi_fit)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5274 -0.4954  0.0497  0.6160  2.6719 
## 
## Coefficients: (65 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7.354e+00  1.434e+00   5.128 3.09e-07 ***
## countryAfghanistan            -3.251e-01  1.667e+00  -0.195 0.845358    
## countryArgentina               6.122e-01  1.188e+00   0.515 0.606342    
## countryAruba                  -2.139e+00  1.382e+00  -1.548 0.121812    
## countryAustralia              -4.195e-01  1.030e+00  -0.407 0.683822    
## countryBahamas                        NA         NA      NA       NA    
## countryBelgium                 2.939e-01  1.381e+00   0.213 0.831464    
## countryBrazil                  5.012e-01  1.444e+00   0.347 0.728499    
## countryBulgaria                       NA         NA      NA       NA    
## countryCambodia                       NA         NA      NA       NA    
## countryCameroon                       NA         NA      NA       NA    
## countryCanada                 -6.380e-01  1.026e+00  -0.622 0.534188    
## countryChile                   5.986e-01  1.393e+00   0.430 0.667360    
## countryChina                   3.943e-01  1.147e+00   0.344 0.731050    
## countryColombia                4.334e-01  1.410e+00   0.307 0.758638    
## `countryCzech Republic`       -3.352e-01  1.224e+00  -0.274 0.784246    
## countryDenmark                 8.783e-02  1.102e+00   0.080 0.936487    
## `countryDominican Republic`           NA         NA      NA       NA    
## countryEgypt                          NA         NA      NA       NA    
## countryFinland                -5.185e-01  1.407e+00  -0.369 0.712519    
## countryFrance                 -2.686e-01  1.026e+00  -0.262 0.793430    
## countryGeorgia                -9.133e-01  1.412e+00  -0.647 0.517690    
## countryGermany                -6.842e-01  1.014e+00  -0.675 0.499680    
## countryGreece                  2.380e-01  1.385e+00   0.172 0.863565    
## `countryHong Kong`            -1.821e-01  1.104e+00  -0.165 0.868940    
## countryHungary                -5.010e-01  1.383e+00  -0.362 0.717132    
## countryIceland                 1.668e-01  1.381e+00   0.121 0.903923    
## countryIndia                   2.375e-01  1.444e+00   0.165 0.869333    
## countryIndonesia              -6.753e-01  1.678e+00  -0.403 0.687339    
## countryIran                   -2.088e-01  1.431e+00  -0.146 0.883982    
## countryIreland                 2.886e-01  1.079e+00   0.267 0.789103    
## countryIsrael                  7.965e-01  1.666e+00   0.478 0.632699    
## countryItaly                   3.970e-01  1.091e+00   0.364 0.715960    
## countryJapan                  -1.429e+00  1.083e+00  -1.319 0.187301    
## countryKenya                          NA         NA      NA       NA    
## countryKyrgyzstan                     NA         NA      NA       NA    
## countryLibya                          NA         NA      NA       NA    
## countryMexico                  3.042e-01  1.125e+00   0.270 0.786837    
## countryNetherlands             3.339e-01  1.482e+00   0.225 0.821782    
## `countryNew Line`             -2.363e+00  1.380e+00  -1.712 0.086965 .  
## `countryNew Zealand`           2.773e-01  1.079e+00   0.257 0.797133    
## countryNigeria                        NA         NA      NA       NA    
## countryNorway                  1.172e-01  1.459e+00   0.080 0.935950    
## `countryOfficial site`        -9.720e-03  1.382e+00  -0.007 0.994390    
## countryPakistan                       NA         NA      NA       NA    
## countryPanama                         NA         NA      NA       NA    
## countryPeru                   -1.812e+00  1.386e+00  -1.308 0.191069    
## countryPhilippines                    NA         NA      NA       NA    
## countryPoland                 -1.282e+00  1.380e+00  -0.929 0.353107    
## countryRomania                -2.042e+00  1.381e+00  -1.478 0.139413    
## countryRussia                 -1.240e+00  1.392e+00  -0.891 0.372796    
## countrySlovakia                       NA         NA      NA       NA    
## countrySlovenia                       NA         NA      NA       NA    
## `countrySouth Africa`          4.050e-01  1.154e+00   0.351 0.725769    
## `countrySouth Korea`          -1.780e+00  1.155e+00  -1.542 0.123199    
## `countrySoviet Union`                 NA         NA      NA       NA    
## countrySpain                   2.685e-01  1.044e+00   0.257 0.797095    
## countrySweden                         NA         NA      NA       NA    
## countrySwitzerland                    NA         NA      NA       NA    
## countryTaiwan                  4.364e-01  1.333e+00   0.327 0.743398    
## countryThailand               -9.652e-01  1.385e+00  -0.697 0.486050    
## countryTurkey                         NA         NA      NA       NA    
## countryUK                     -1.797e-02  1.020e+00  -0.018 0.985949    
## `countryUnited Arab Emirates`         NA         NA      NA       NA    
## countryUSA                    -5.128e-01  1.020e+00  -0.503 0.615080    
## `countryWest Germany`                 NA         NA      NA       NA    
## content_ratingApproved        -3.307e-01  6.522e-01  -0.507 0.612152    
## content_ratingG               -4.456e-02  3.428e-01  -0.130 0.896577    
## content_ratingGP              -3.379e-01  1.187e+00  -0.285 0.775939    
## content_ratingM                1.447e+00  1.196e+00   1.209 0.226583    
## `content_ratingNC-17`          5.890e-02  5.153e-01   0.114 0.909010    
## `content_ratingNot Rated`      3.364e-01  3.779e-01   0.890 0.373410    
## content_ratingPassed           1.979e-01  1.360e+00   0.146 0.884286    
## content_ratingPG              -9.516e-02  3.295e-01  -0.289 0.772731    
## `content_ratingPG-13`         -4.809e-03  3.293e-01  -0.015 0.988347    
## content_ratingR                4.355e-01  3.281e-01   1.327 0.184509    
## `content_ratingTV-14`                 NA         NA      NA       NA    
## `content_ratingTV-G`                  NA         NA      NA       NA    
## `content_ratingTV-MA`                 NA         NA      NA       NA    
## `content_ratingTV-PG`                 NA         NA      NA       NA    
## `content_ratingTV-Y`                  NA         NA      NA       NA    
## `content_ratingTV-Y7`                 NA         NA      NA       NA    
## content_ratingUnrated          4.332e-01  3.973e-01   1.090 0.275703    
## content_ratingX                       NA         NA      NA       NA    
## color1                        -5.013e-01  9.409e-02  -5.328 1.06e-07 ***
## title_year1920                        NA         NA      NA       NA    
## title_year1925                        NA         NA      NA       NA    
## title_year1927                 3.577e-01  1.014e+00   0.353 0.724349    
## title_year1929                -1.787e-01  1.626e+00  -0.110 0.912508    
## title_year1930                        NA         NA      NA       NA    
## title_year1932                        NA         NA      NA       NA    
## title_year1933                 8.738e-01  9.692e-01   0.902 0.367362    
## title_year1934                        NA         NA      NA       NA    
## title_year1935                 1.735e+00  1.092e+00   1.589 0.112184    
## title_year1936                 2.196e+00  9.490e-01   2.314 0.020741 *  
## title_year1937                 1.273e+00  1.090e+00   1.168 0.242858    
## title_year1938                        NA         NA      NA       NA    
## title_year1939                 1.427e+00  9.437e-01   1.512 0.130665    
## title_year1940                 1.523e+00  1.089e+00   1.399 0.162025    
## title_year1941                        NA         NA      NA       NA    
## title_year1942                        NA         NA      NA       NA    
## title_year1943                        NA         NA      NA       NA    
## title_year1944                        NA         NA      NA       NA    
## title_year1945                        NA         NA      NA       NA    
## title_year1946                 9.255e-01  6.893e-01   1.343 0.179461    
## title_year1947                 9.155e-01  9.612e-01   0.952 0.340936    
## title_year1948                 1.480e+00  1.089e+00   1.359 0.174366    
## title_year1949                        NA         NA      NA       NA    
## title_year1950                 8.292e-01  1.620e+00   0.512 0.608758    
## title_year1951                        NA         NA      NA       NA    
## title_year1952                 2.680e-01  9.565e-01   0.280 0.779401    
## title_year1953                 4.160e-01  7.368e-01   0.565 0.572403    
## title_year1954                 1.038e+00  7.092e-01   1.464 0.143361    
## title_year1955                        NA         NA      NA       NA    
## title_year1956                        NA         NA      NA       NA    
## title_year1957                 1.799e+00  9.398e-01   1.915 0.055618 .  
## title_year1958                        NA         NA      NA       NA    
## title_year1959                 1.471e+00  9.614e-01   1.530 0.126082    
## title_year1960                 1.541e+00  9.427e-01   1.635 0.102148    
## title_year1961                 1.038e+00  9.633e-01   1.077 0.281484    
## title_year1962                 1.737e+00  7.324e-01   2.372 0.017756 *  
## title_year1963                 1.260e+00  7.818e-01   1.612 0.107082    
## title_year1964                 1.409e+00  6.440e-01   2.188 0.028757 *  
## title_year1965                 1.146e+00  4.916e-01   2.332 0.019775 *  
## title_year1966                 2.384e+00  1.245e+00   1.915 0.055596 .  
## title_year1967                 6.091e-01  1.089e+00   0.559 0.576038    
## title_year1968                 1.346e+00  6.790e-01   1.983 0.047471 *  
## title_year1969                -4.324e-01  9.519e-01  -0.454 0.649664    
## title_year1970                 1.215e+00  5.009e-01   2.427 0.015281 *  
## title_year1971                 4.694e-01  6.712e-01   0.699 0.484388    
## title_year1972                 1.193e+00  6.984e-01   1.708 0.087767 .  
## title_year1973                 7.476e-01  4.394e-01   1.701 0.088963 .  
## title_year1974                 1.001e+00  4.010e-01   2.495 0.012632 *  
## title_year1975                 1.603e+00  5.517e-01   2.906 0.003682 ** 
## title_year1976                 1.291e+00  6.699e-01   1.927 0.054029 .  
## title_year1977                 7.280e-01  4.031e-01   1.806 0.070998 .  
## title_year1978                 3.491e-01  4.008e-01   0.871 0.383891    
## title_year1979                 6.827e-01  4.050e-01   1.686 0.091977 .  
## title_year1980                 4.335e-01  2.902e-01   1.494 0.135239    
## title_year1981                 1.346e-01  2.687e-01   0.501 0.616383    
## title_year1982                 1.973e-01  2.810e-01   0.702 0.482738    
## title_year1983                 4.324e-01  2.891e-01   1.495 0.134878    
## title_year1984                 3.730e-01  2.471e-01   1.510 0.131214    
## title_year1985                 3.663e-01  2.732e-01   1.341 0.180106    
## title_year1986                -1.326e-01  2.271e-01  -0.584 0.559431    
## title_year1987                -6.187e-02  2.146e-01  -0.288 0.773189    
## title_year1988                 4.937e-02  2.162e-01   0.228 0.819404    
## title_year1989                 3.477e-01  2.099e-01   1.657 0.097684 .  
## title_year1990                 3.195e-01  2.240e-01   1.426 0.153861    
## title_year1991                 1.456e-02  2.136e-01   0.068 0.945646    
## title_year1992                 2.197e-01  2.075e-01   1.059 0.289720    
## title_year1993                 2.468e-01  1.903e-01   1.297 0.194807    
## title_year1994                 1.305e-01  1.842e-01   0.709 0.478659    
## title_year1995                 1.058e-02  1.726e-01   0.061 0.951097    
## title_year1996                -2.917e-02  1.610e-01  -0.181 0.856272    
## title_year1997                -5.499e-02  1.590e-01  -0.346 0.729435    
## title_year1998                 8.451e-03  1.559e-01   0.054 0.956778    
## title_year1999                -5.593e-02  1.486e-01  -0.376 0.706658    
## title_year2000                -1.863e-01  1.484e-01  -1.256 0.209333    
## title_year2001                -1.524e-01  1.465e-01  -1.040 0.298363    
## title_year2002                -1.845e-01  1.453e-01  -1.270 0.204175    
## title_year2003                -2.484e-01  1.502e-01  -1.653 0.098348 .  
## title_year2004                -1.868e-02  1.467e-01  -0.127 0.898704    
## title_year2005                 8.089e-03  1.463e-01   0.055 0.955899    
## title_year2006                -6.828e-02  1.461e-01  -0.467 0.640252    
## title_year2007                 8.395e-02  1.499e-01   0.560 0.575481    
## title_year2008                -9.465e-02  1.457e-01  -0.650 0.515944    
## title_year2009                -9.677e-02  1.465e-01  -0.661 0.508916    
## title_year2010                -8.496e-02  1.473e-01  -0.577 0.564008    
## title_year2011                -9.490e-02  1.478e-01  -0.642 0.520872    
## title_year2012                -4.618e-02  1.490e-01  -0.310 0.756686    
## title_year2013                 7.693e-02  1.480e-01   0.520 0.603209    
## title_year2014                -3.024e-02  1.506e-01  -0.201 0.840877    
## title_year2015                -4.335e-02  1.547e-01  -0.280 0.779384    
## title_year2016                        NA         NA      NA       NA    
## languageAboriginal             2.728e-01  1.141e+00   0.239 0.811075    
## languageArabic                 6.225e-01  1.324e+00   0.470 0.638383    
## languageAramaic                3.411e-01  1.318e+00   0.259 0.795844    
## languageBosnian               -2.382e+00  1.318e+00  -1.807 0.070913 .  
## languageCantonese              2.193e-01  1.083e+00   0.203 0.839517    
## languageChinese                       NA         NA      NA       NA    
## languageCzech                  5.037e-01  1.481e+00   0.340 0.733852    
## languageDanish                 5.887e-01  1.155e+00   0.510 0.610421    
## languageDari                   1.112e+00  1.319e+00   0.843 0.399442    
## languageDutch                         NA         NA      NA       NA    
## languageDzongkha                      NA         NA      NA       NA    
## languageEnglish               -4.021e-01  9.328e-01  -0.431 0.666425    
## languageFilipino              -6.294e-01  1.318e+00  -0.478 0.632954    
## languageFrench                 6.127e-01  9.528e-01   0.643 0.520196    
## languageGerman                 9.270e-01  9.963e-01   0.930 0.352193    
## languageGreek                         NA         NA      NA       NA    
## languageHebrew                        NA         NA      NA       NA    
## languageHindi                         NA         NA      NA       NA    
## languageHungarian              3.039e-01  1.612e+00   0.189 0.850464    
## languageIcelandic                     NA         NA      NA       NA    
## languageIndonesian             1.064e+00  1.330e+00   0.800 0.423772    
## languageItalian               -9.384e-01  1.065e+00  -0.881 0.378301    
## languageJapanese               1.818e+00  1.035e+00   1.757 0.078967 .  
## languageKannada                       NA         NA      NA       NA    
## languageKazakh                -1.028e+00  1.319e+00  -0.779 0.435760    
## languageKorean                 2.283e+00  1.156e+00   1.975 0.048360 *  
## languageMandarin              -4.460e-01  1.077e+00  -0.414 0.678799    
## languageMaya                   8.699e-01  1.318e+00   0.660 0.509299    
## languageMongolian              1.143e+00  1.623e+00   0.704 0.481313    
## languageNone                   2.248e+00  1.319e+00   1.705 0.088335 .  
## languageNorwegian                     NA         NA      NA       NA    
## languagePanjabi                       NA         NA      NA       NA    
## languagePersian                1.529e+00  1.473e+00   1.038 0.299522    
## languagePolish                        NA         NA      NA       NA    
## languagePortuguese                    NA         NA      NA       NA    
## languageRomanian               2.664e+00  1.625e+00   1.639 0.101299    
## languageRussian                4.644e-01  1.623e+00   0.286 0.774819    
## languageSlovenian                     NA         NA      NA       NA    
## languageSpanish               -2.315e-01  9.750e-01  -0.237 0.812357    
## languageSwahili                       NA         NA      NA       NA    
## languageSwedish                       NA         NA      NA       NA    
## languageTamil                         NA         NA      NA       NA    
## languageTelugu                        NA         NA      NA       NA    
## languageThai                   3.922e-01  1.427e+00   0.275 0.783483    
## languageUrdu                          NA         NA      NA       NA    
## languageVietnamese             7.462e-01  1.318e+00   0.566 0.571426    
## languageZulu                          NA         NA      NA       NA    
## facenumber_in_poster1          8.938e-03  3.983e-02   0.224 0.822459    
## facenumber_in_poster2         -5.615e-02  4.866e-02  -1.154 0.248672    
## facenumber_in_poster3         -6.115e-02  6.136e-02  -0.997 0.319037    
## facenumber_in_poster4         -1.397e-01  7.994e-02  -1.748 0.080556 .  
## facenumber_in_poster5         -1.906e-01  1.125e-01  -1.695 0.090160 .  
## facenumber_in_poster6         -1.151e-01  1.289e-01  -0.894 0.371581    
## facenumber_in_poster7         -1.545e-01  1.787e-01  -0.864 0.387467    
## facenumber_in_poster8         -1.706e-01  1.730e-01  -0.986 0.324047    
## facenumber_in_poster9         -2.779e-01  2.958e-01  -0.939 0.347589    
## facenumber_in_poster10        -4.249e-01  3.818e-01  -1.113 0.265918    
## facenumber_in_poster11        -1.663e+00  4.176e-01  -3.983 6.95e-05 ***
## facenumber_in_poster12        -2.029e-01  5.381e-01  -0.377 0.706203    
## facenumber_in_poster13        -3.049e-01  9.328e-01  -0.327 0.743810    
## facenumber_in_poster14         1.053e-01  1.008e+00   0.104 0.916808    
## facenumber_in_poster15         1.779e-01  4.664e-01   0.381 0.702859    
## facenumber_in_poster19        -1.049e+00  9.319e-01  -1.126 0.260432    
## facenumber_in_poster31         7.011e-01  9.317e-01   0.752 0.451805    
## facenumber_in_poster43         1.722e+00  9.314e-01   1.849 0.064605 .  
## gross                          4.385e-09  2.500e-10  17.540  < 2e-16 ***
## dicaprio                       4.760e-01  2.218e-01   2.146 0.031911 *  
## hanks                          8.224e-01  1.937e-01   4.246 2.23e-05 ***
## eastwood                       5.695e-01  2.923e-01   1.949 0.051426 .  
## hardy                          5.444e-01  3.241e-01   1.680 0.093133 .  
## rickman                        7.105e-01  3.328e-01   2.135 0.032817 *  
## cumberbatch                    5.934e-01  3.543e-01   1.675 0.094046 .  
## bale                           5.215e-01  2.108e-01   2.474 0.013418 *  
## hoffman                        7.741e-01  2.272e-01   3.407 0.000664 ***
## driver                                NA         NA      NA       NA    
## ford                           3.850e-01  1.919e-01   2.006 0.044978 *  
## spacy                          7.366e-01  2.001e-01   3.680 0.000236 ***
## nolan                          1.086e+00  3.449e-01   3.149 0.001655 ** 
## tarantino                      1.339e+00  3.330e-01   4.021 5.91e-05 ***
## cameron                        2.851e-01  3.622e-01   0.787 0.431173    
## jackson                        6.078e-01  3.770e-01   1.612 0.107022    
## fincher                        1.124e+00  2.956e-01   3.802 0.000146 ***
## scorsese                       9.034e-01  2.469e-01   3.660 0.000256 ***
## anderson                       1.347e+00  3.532e-01   3.814 0.000139 ***
## Greengrass                     7.659e-01  3.542e-01   2.162 0.030649 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9281 on 3461 degrees of freedom
## Multiple R-squared:  0.271,  Adjusted R-squared:  0.2302 
## F-statistic: 6.633 on 194 and 3461 DF,  p-value: < 2.2e-16

Train Gradient Boosted Machine Method to predict imdb_score

set.seed(1234)
gbm_fit <- train(imdb_score ~ country + content_rating + color + title_year + 
    language + facenumber_in_poster + gross + dicaprio + hanks + 
    eastwood + hardy + rickman + cumberbatch + bale + hoffman + 
    driver + ford + spacy + nolan + tarantino + cameron + jackson + 
    fincher + scorsese + anderson + Greengrass,
                 data = movie_df,
                 trControl = fit_control,
                 method = "gbm",
                 verbose = F)

print(gbm_fit)
## Stochastic Gradient Boosting 
## 
## 3656 samples
##   26 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3291, 3291, 3291, 3291, 3290, 3291, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   1                   50      0.9815368  0.1611215  0.7606611
##   1                  100      0.9664423  0.1741769  0.7485170
##   1                  150      0.9580304  0.1845483  0.7411589
##   2                   50      0.9656447  0.1743920  0.7473884
##   2                  100      0.9561040  0.1851505  0.7390235
##   2                  150      0.9522526  0.1906169  0.7348484
##   3                   50      0.9601786  0.1813499  0.7418423
##   3                  100      0.9535586  0.1881617  0.7358058
##   3                  150      0.9542481  0.1871194  0.7357722
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

Train Support Machine Vector to predict imdb_score

svm_fit <- train(imdb_score ~ country + content_rating + color + title_year + 
    language + facenumber_in_poster + gross + dicaprio + hanks + 
    eastwood + hardy + rickman + cumberbatch + bale + hoffman + 
    driver + ford + spacy + nolan + tarantino + cameron + jackson + 
    fincher + scorsese + anderson + Greengrass,
                 data = movie_df,
                 trControl = fit_control,
                 method = "svmRadial",
                 verbose = F)

print(svm_fit)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3656 samples
##   26 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3291, 3291, 3290, 3289, 3290, 3290, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared     MAE      
##   0.25  1.060589  0.002779784  0.8141206
##   0.50  1.060359  0.002658155  0.8141450
##   1.00  1.059011  0.002070716  0.8143338
## 
## Tuning parameter 'sigma' was held constant at a value of 2.26418e-14
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 2.26418e-14 and C = 1.

Train Decision Tree to predict imdb_score

dtree_fit <- train(imdb_score ~ country + content_rating + color + title_year + 
    language + facenumber_in_poster + gross + dicaprio + hanks + 
    eastwood + hardy + rickman + cumberbatch + bale + hoffman + 
    driver + ford + spacy + nolan + tarantino + cameron + jackson + 
    fincher + scorsese + anderson + Greengrass, 
            data = movie_df,
            method = "rpart",
            trControl = fit_control,
            tuneLength = 20,
            parms=list(split='information'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
print(dtree_fit)
## CART 
## 
## 3656 samples
##   26 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3291, 3290, 3290, 3291, 3291, 3290, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE       Rsquared    MAE      
##   0.001738798  0.9907022  0.13278909  0.7630285
##   0.002068381  0.9820585  0.14275199  0.7585432
##   0.002131491  0.9806695  0.14487365  0.7577790
##   0.002342080  0.9768093  0.14983026  0.7539785
##   0.002369088  0.9768093  0.14983026  0.7539785
##   0.002390306  0.9764477  0.15020232  0.7540550
##   0.002431604  0.9764581  0.15019392  0.7540253
##   0.002551606  0.9755093  0.15163191  0.7535363
##   0.003467947  0.9743540  0.15318537  0.7546189
##   0.003775077  0.9742853  0.15326460  0.7545389
##   0.004995151  0.9786091  0.14613286  0.7595301
##   0.005116126  0.9791663  0.14516994  0.7602984
##   0.005492177  0.9803517  0.14313246  0.7611767
##   0.006169864  0.9825830  0.13899321  0.7634161
##   0.009904726  0.9871068  0.13034482  0.7648497
##   0.010872825  0.9894890  0.12620686  0.7667665
##   0.013958926  0.9996821  0.10838720  0.7748752
##   0.032847341  1.0160793  0.07906620  0.7891151
##   0.035907783  1.0288863  0.05627205  0.7984240
##   0.041406660  1.0515144  0.02531107  0.8166404
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was cp = 0.003775077.
# Might need to install package rpart.plot
# Output the final model for the decision tree
rpart.plot::rpart.plot(dtree_fit$finalModel)

Summarize Findings

Resample and show results to find the best model

results <- resamples(list(linear = multi_fit, GBM = gbm_fit, SVM = svm_fit, rpart = dtree_fit))

summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: linear, GBM, SVM, rpart 
## Number of resamples: 10 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## linear 0.6721460 0.7182493 0.7450640 0.7377693 0.7696409 0.7801051    0
## GBM    0.6852720 0.7185370 0.7418544 0.7348484 0.7505480 0.7731188    0
## SVM    0.7855687 0.7997161 0.8086314 0.8143338 0.8321575 0.8438801    0
## rpart  0.7133329 0.7454524 0.7529411 0.7545389 0.7681900 0.8044121    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## linear 0.8577889 0.9273996 0.9568565 0.9511951 0.9912828 1.015070    0
## GBM    0.8641010 0.9306892 0.9549358 0.9522526 0.9836367 1.011582    0
## SVM    1.0061206 1.0242484 1.0477408 1.0590112 1.0956613 1.121755    0
## rpart  0.8992071 0.9548304 0.9794503 0.9742853 0.9970306 1.026799    0
## 
## Rsquared 
##                Min.      1st Qu.       Median        Mean    3rd Qu.
## linear 1.073202e-01 0.1763706475 0.2036895384 0.199329762 0.23621639
## GBM    1.301923e-01 0.1642848221 0.2090832055 0.190616876 0.21689205
## SVM    1.043988e-05 0.0001405843 0.0009386559 0.002070716 0.00307456
## rpart  1.034697e-01 0.1475240234 0.1562923170 0.153264596 0.16885243
##               Max. NA's
## linear 0.253053285    0
## GBM    0.226620546    0
## SVM    0.009224642    1
## rpart  0.192979345    0
  1. Linear Model with a 10-fold cross validation: RMSE: 0.9511951 R2: 0.1993298
  2. GBM with a 10-fold cross validation: RMSE: 0.9522526 R2: 0.1906169
  3. Decision Tree with a 10-fold cross validation: RMSE: 0.9756082 R2: 0.15047172
  4. SVM (Radial Kernal) with a 10-fold cross validation: RMSE: 1.059011 R2: 0.002070716

The Linear Model with a 10-fold cross validation has the best performance. It was the lowest RMSE and the highest R2.

According to the linear model, there is a significantly negative effect of having a film in color. This may relate to how newer films are now in color. When you consider that films have slightly lowered in their average IMDb score over time, it may not be ideal to recommend that film producers shoot their films in grayscale if they want to improve performance.

The decision tree can give better actionable steps to improve the IMDb rating for a film to be released in the future. In general, movies that are released in the UK, aren’t produced in color, are rated R, and aren’t in English tend to have higher IMDb scores. Movies with a higher gross tend to have the highest IMDb scores. There are certain actors and directors who consistently perform well and are featured in films with high ratings. It may be recommended to consider any of these factors for a film producer who wants to improve their movie’s critical response.